Chapter 1 - Assignment 1

This course will help me getting familiar with R, RStudio and GitHub. I hope I can learn some basics of data science and apply this knowledge on some of the data gathered during my PhD.

My GitHub repository

Everything went smoothly with first assignment. Installation of the 3 programs went ok, and everything seems to be running. It is my first time working with GitHub and Rstudio. I have been trough some of the R Markdown syntax, and the language is quite strange for me. I didn’t quite understood what the Personal Access Token brings more in terms of connection between RStudio and GitHub, as you already are working with the shared IODS.


date()
## [1] "Mon Dec 11 17:50:16 2023"

Chapter 2 - Assignment 2

A full data set was provided for this exercise (https://www.mv.helsinki.fi/home/kvehkala/JYTmooc/JYTOPKYS3-data.txt). By reading the txt file and exploring the dimensions”dim()” and structure”str()” of the data we can see that the row data contains 183 observations and 64 variables.

We want to group the combination variables inside a specific group. The variables beginning with “D”, correspond to the deep questions, variables beginning with “SU” correspond to Surface questions, and variables begging with “ST”correspond to Strategic questions.

By combine multiple questions into combination variables, we have now 7 variables in total. From the 183 observations, we have excluded 17 observations, which are the the exam points marked as zero. so, in summary we have now 166 observations and 7 variables.

This was done inside the R script “create_learning_2014”. After arranging the data I have created the cvs file inside the folder ‘data’ which will be use for the analysis.

library(dplyr)
library(GGally)
## Warning: package 'GGally' was built under R version 4.3.2
library(ggplot2)
library(tidyverse)
## Warning: package 'readr' was built under R version 4.3.2
#reading our csv file and explore a bit their content by seeing dimensions and structure.

output_learning2014 <- read.csv("data/output_learning2014.csv",sep = ",", header = T)

#dimensions of out data frame
dim(output_learning2014)
## [1] 166   7
#structure
str(output_learning2014)
## 'data.frame':    166 obs. of  7 variables:
##  $ gender  : chr  "F" "M" "F" "M" ...
##  $ Age     : int  53 55 49 53 49 38 50 37 37 42 ...
##  $ attitude: num  3.7 3.1 2.5 3.5 3.7 3.8 3.5 2.9 3.8 2.1 ...
##  $ deep    : num  3.58 2.92 3.5 3.5 3.67 ...
##  $ stra    : num  3.38 2.75 3.62 3.12 3.62 ...
##  $ surf    : num  2.58 3.17 2.25 2.25 2.83 ...
##  $ Points  : int  25 12 24 10 22 21 21 31 24 26 ...
#quick overview of the structure and values
head(output_learning2014) 
##   gender Age attitude     deep  stra     surf Points
## 1      F  53      3.7 3.583333 3.375 2.583333     25
## 2      M  55      3.1 2.916667 2.750 3.166667     12
## 3      F  49      2.5 3.500000 3.625 2.250000     24
## 4      M  53      3.5 3.500000 3.125 2.250000     10
## 5      M  49      3.7 3.666667 3.625 2.833333     22
## 6      F  38      3.8 4.750000 3.625 2.416667     21
#library(GGally)
#library(ggplot2)

#output_learning2014 <- read.csv("data/output_learning2014.csv",sep = ",", header = T)


# create a more advanced plot matrix with ggpairs()
#graphical overview of the data and show summaries of the variables in the data. 

p <- ggpairs(output_learning2014, mapping = aes(col=gender, alpha=0.3),
             lower = list(combo = wrap("facethist", bins = 20)))

#color aesthetic (col) is mapped to the "gender" variable, and alpha is set to 0.3 for transparency.
p

When we use, not just pairs() but work toguether with the libraries GGally and ggplot2 we have a more detailed look at the variables, their distributions and relationships.

The ggpairs give us a graphical overview of our data and give us summaries of the variables. We can see from the matrix chart the distributions of ‘Age’, ‘attitude’, ‘deep’, ‘stra’, ‘surf’‘, and ’points’. For example, for the ‘Age’, we con observe that we have the pick of our distribution around the age of 20 for females.
Also, we have the correlations between ‘Age’ vs ‘Attitude’, ‘Age’ vs ‘deep’, ‘Age’ vs ‘Stra’, ‘Age’ vs ‘surf’, ‘Age’ vs ‘Points’, ‘attitude’ vs ‘deep’, ‘attitude vs ’stra’, ‘attitude’ vs ‘surf’, ‘attitude’ vs ‘Points’, ‘deep’ vs ‘stra’, ‘deep’ vs ‘surf’, ‘deep’ vs ‘Points’, ‘stra’ vs ‘surf’, ‘stra’ vs ‘Points’, ‘surf’ vs ‘Points’.

Now we are creating the module for the regression asked in the exercise. I have chosen ‘attitude’, ‘deep’ and ‘stra’ as explanatory variables, and have fitted a regression model where exam points is the target variable.

#library(GGally)
#library(ggplot2)

#output_learning2014 <- read.csv("data/output_learning2014.csv",sep = ",", header = T)

#Points will be the target variable.
#explanatory variables will be attitude, deep and stra
#regression model is the following, taking in consideration the up comments

qplot(attitude + deep + stra, Points , data = output_learning2014) + geom_smooth(method = "lm")
## Warning: `qplot()` was deprecated in ggplot2 3.4.0.
## This warning is displayed once every 8 hours.
## Call `lifecycle::last_lifecycle_warnings()` to see where this warning was
## generated.

# fitting a linear model
model1 <- lm(Points ~ attitude + deep + stra, data = output_learning2014)

# printing my summary of the model
summary(model1)
## 
## Call:
## lm(formula = Points ~ attitude + deep + stra, data = output_learning2014)
## 
## Residuals:
##      Min       1Q   Median       3Q      Max 
## -17.5239  -3.4276   0.5474   3.8220  11.5112 
## 
## Coefficients:
##             Estimate Std. Error t value Pr(>|t|)    
## (Intercept)  11.3915     3.4077   3.343  0.00103 ** 
## attitude      3.5254     0.5683   6.203 4.44e-09 ***
## deep         -0.7492     0.7507  -0.998  0.31974    
## stra          0.9621     0.5367   1.793  0.07489 .  
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 5.289 on 162 degrees of freedom
## Multiple R-squared:  0.2097, Adjusted R-squared:  0.195 
## F-statistic: 14.33 on 3 and 162 DF,  p-value: 2.521e-08

From the R console we can look for the Estimate values and understand what happens to target variable (if it goes up or down and by how much). In this example, we can estimate that if we increase by one unit in ‘Points’ then the ‘attitude’ will increase 3.5254, ‘deep’ will decrease 0.7492 and ‘stra’ will increase 0.9621.

Also, we have the Multiple R-squared, which explains that 19.4% of the variation in the dependent variable. It is the percentage of how much has been explained by these predictors.

I have removed the explanatory variable ‘deep’ from the code and fitted again the model.

#library(GGally)
#library(ggplot2)

#output_learning2014 <- read.csv("data/output_learning2014.csv",sep = ",", header = T)

#explanatory variables will be attitude and stra
#Points will the target variable.
#regression model is the following, taking in consideration the up comments

qplot(attitude + stra, Points , data = output_learning2014) + geom_smooth(method = "lm")

# fitting a linear model
model2 <- lm(Points ~ attitude + stra, data = output_learning2014)

# print out again my summary of the model
summary(model2)
## 
## Call:
## lm(formula = Points ~ attitude + stra, data = output_learning2014)
## 
## Residuals:
##      Min       1Q   Median       3Q      Max 
## -17.6436  -3.3113   0.5575   3.7928  10.9295 
## 
## Coefficients:
##             Estimate Std. Error t value Pr(>|t|)    
## (Intercept)   8.9729     2.3959   3.745  0.00025 ***
## attitude      3.4658     0.5652   6.132 6.31e-09 ***
## stra          0.9137     0.5345   1.709  0.08927 .  
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 5.289 on 163 degrees of freedom
## Multiple R-squared:  0.2048, Adjusted R-squared:  0.1951 
## F-statistic: 20.99 on 2 and 163 DF,  p-value: 7.734e-09

I have set up a 2x2 Grid for the next plots using the function par().

I have produce diagnostic plots using the plot() function with the which argument to specify which diagnostic plots i want, according with the assignment. In this case, which = c(1, 2, 5) will produce me Residuals vs Fitted values, Normal QQ-plot, and Residuals vs Leverage plots.

For the first one, residuals should be independent of each other. There should be no systematic pattern in the residuals over time or across observations. In our case, we see that the residuals are equally and randomly distributed around 0, with essentially no pattern. Also, overall we see that the spread of the residuals is roughly equal across the range of fitted values. As residuals are linearly distributed, variance is uniform.

A Normal Quantile-Quantile (QQ) plot of the residuals can be used to assess normality. The points on the QQ plot don’t deviate significantly from the diagonal line, which suggests are normally distributed.

With the residuals vs leverage plot We’re looking at how the spread of standardized residuals changes as the leverage The spread of standardized residuals shouldn’t change as a function of leverage. Our model tries to minimize the distance of the line from all the points. Points which are further from the line can sometimes have a greater influence over the plot and deleting them can change the model a lot. We can observe that we have more points concentrated on the left part of the plot, and some of the points close to the Cook’s distance which can mean that those observations are not exerting a high influence on the regression coefficients.

#library(GGally)
#library(ggplot2)

#output_learning2014 <- read.csv("data/output_learning2014.csv",sep = ",", header = T)

#my grid
par(mfrow = c(2,2))

#draw of the diagnostic plots: Residuals vs Fitted values, Normal QQ-plot and Residuals vs Leverage.
plot(model2, which=c(1,2,5))


Chapter 3 - Assignment 3

This data approach student achievement in secondary education of two Portuguese schools. The data include student grades, demographic, social and school related features) and it was collected by using school reports and questionnaires. Two datasets are provided regarding the performance in two distinct subjects: Mathematics (mat) and Portuguese language (por). In the R script ’ create_alc.R’ I have joined the two data sets using all other variables than “failures”, “paid”, “absences”, “G1”, “G2”, “G3” as (student) identifiers. At the end our data set has 370 observations and 35 variables.

Because we will use it for the analysis, one think important to understand is our logical column ‘high_use’, which was created in the wrangling exercise part. This column shows as TRUE when the average of the answers related to weekday and weekend alcohol consumption is > 2 and FALSE otherwise.

For the exercise I will choose some of the variables, which will be explained later above.

library(tidyr)
library(readr)
library(dplyr)
library(ggplot2)


alc <- read.csv("data/alc.csv",sep = ",", header = T)

variables_names <- colnames(alc)
variables_names
##  [1] "school"     "sex"        "age"        "address"    "famsize"   
##  [6] "Pstatus"    "Medu"       "Fedu"       "Mjob"       "Fjob"      
## [11] "reason"     "guardian"   "traveltime" "studytime"  "schoolsup" 
## [16] "famsup"     "activities" "nursery"    "higher"     "internet"  
## [21] "romantic"   "famrel"     "freetime"   "goout"      "Dalc"      
## [26] "Walc"       "health"     "failures"   "paid"       "absences"  
## [31] "G1"         "G2"         "G3"         "alc_use"    "high_use"
dim(alc)
## [1] 370  35
str(alc)
## 'data.frame':    370 obs. of  35 variables:
##  $ school    : chr  "GP" "GP" "GP" "GP" ...
##  $ sex       : chr  "F" "F" "F" "F" ...
##  $ age       : int  18 17 15 15 16 16 16 17 15 15 ...
##  $ address   : chr  "U" "U" "U" "U" ...
##  $ famsize   : chr  "GT3" "GT3" "LE3" "GT3" ...
##  $ Pstatus   : chr  "A" "T" "T" "T" ...
##  $ Medu      : int  4 1 1 4 3 4 2 4 3 3 ...
##  $ Fedu      : int  4 1 1 2 3 3 2 4 2 4 ...
##  $ Mjob      : chr  "at_home" "at_home" "at_home" "health" ...
##  $ Fjob      : chr  "teacher" "other" "other" "services" ...
##  $ reason    : chr  "course" "course" "other" "home" ...
##  $ guardian  : chr  "mother" "father" "mother" "mother" ...
##  $ traveltime: int  2 1 1 1 1 1 1 2 1 1 ...
##  $ studytime : int  2 2 2 3 2 2 2 2 2 2 ...
##  $ schoolsup : chr  "yes" "no" "yes" "no" ...
##  $ famsup    : chr  "no" "yes" "no" "yes" ...
##  $ activities: chr  "no" "no" "no" "yes" ...
##  $ nursery   : chr  "yes" "no" "yes" "yes" ...
##  $ higher    : chr  "yes" "yes" "yes" "yes" ...
##  $ internet  : chr  "no" "yes" "yes" "yes" ...
##  $ romantic  : chr  "no" "no" "no" "yes" ...
##  $ famrel    : int  4 5 4 3 4 5 4 4 4 5 ...
##  $ freetime  : int  3 3 3 2 3 4 4 1 2 5 ...
##  $ goout     : int  4 3 2 2 2 2 4 4 2 1 ...
##  $ Dalc      : int  1 1 2 1 1 1 1 1 1 1 ...
##  $ Walc      : int  1 1 3 1 2 2 1 1 1 1 ...
##  $ health    : int  3 3 3 5 5 5 3 1 1 5 ...
##  $ failures  : int  0 0 2 0 0 0 0 0 0 0 ...
##  $ paid      : chr  "no" "no" "yes" "yes" ...
##  $ absences  : int  5 3 8 1 2 8 0 4 0 0 ...
##  $ G1        : int  2 7 10 14 8 14 12 8 16 13 ...
##  $ G2        : int  8 8 10 14 12 14 12 9 17 14 ...
##  $ G3        : int  8 8 11 14 12 14 12 10 18 14 ...
##  $ alc_use   : num  1 1 2.5 1 1.5 1.5 1 1 1 1 ...
##  $ high_use  : logi  FALSE FALSE TRUE FALSE FALSE FALSE ...

The variables that will be in interest to us are ‘goout’, which means going out with friends (numeric: from 1 - very low to 5 - very high), ‘failures’ which are the number of past class failures (numeric: n if 1<=n<3, else 4), ‘absences’ which are the number of school absences (numeric: from 0 to 93), ‘G3’ represents the final grade (numeric: from 0 to 20, output target).

For each variable we want to analyse its relationship with the alcohol consumption. So for each variable we can formulate a priori one personal hypothesis: Higher levels of going out with friends (‘goout’) might be associated with an increase in alcohol consumption. Students who go out more frequently may have higher alcohol consumption. Students with a higher number of past class failures (‘failures’) may exhibit higher alcohol consumption. A higher number of school absences (‘absences’) may be associated with higher alcohol consumption. Students who are frequently absent might have more opportunities for social activities, including alcohol consumption. There might be an inverse relationship between the final grade (‘G3’) and alcohol consumption. Students with lower grades might be more likely to engage in higher alcohol consumption.

I have analysed the data with our 4 variables using cross-tabulations, bar plots and box plots. The results shown to be a bit different from the hypotheses. We can conclude that we have more high use of alcohol when students are going out more often . The students who have high level of alcohol are the ones with zero failures at classes. The high level of alcohol consumption decreases with the number of failures. Students with more class absences present less high alcohol ingestion. Students with meddle range grades present the ones with more alcoohl consuption.

alc %>% group_by(failures) %>% summarise(count = n()) 
## # A tibble: 4 × 2
##   failures count
##      <int> <int>
## 1        0   325
## 2        1    24
## 3        2    17
## 4        3     4
alc %>% group_by(absences) %>% summarise(count = n()) 
## # A tibble: 26 × 2
##    absences count
##       <int> <int>
##  1        0    63
##  2        1    50
##  3        2    56
##  4        3    38
##  5        4    35
##  6        5    22
##  7        6    21
##  8        7    12
##  9        8    20
## 10        9    11
## # ℹ 16 more rows
alc %>% group_by(G3) %>% summarise(count = n()) 
## # A tibble: 18 × 2
##       G3 count
##    <int> <int>
##  1     0     2
##  2     2     1
##  3     3     1
##  4     4     6
##  5     5     5
##  6     6    21
##  7     7     5
##  8     8    26
##  9     9     7
## 10    10    63
## 11    11    25
## 12    12    78
## 13    13    25
## 14    14    38
## 15    15    16
## 16    16    37
## 17    17     7
## 18    18     7
alc %>% group_by(goout) %>% summarise(count = n()) 
## # A tibble: 5 × 2
##   goout count
##   <int> <int>
## 1     1    22
## 2     2    97
## 3     3   120
## 4     4    78
## 5     5    53
#alc %>%
#  group_by(high_use, sex) %>%
#  summarise(failures = median(failures))


#summary(alc$address)

#cross-tabulations


# Cross-tabulation of 'goout' with 'alc_use'
cross_tab_goout <- table(alc$goout, alc$high_use)
print("Cross-tabulation of 'goout' with 'high_use'")
## [1] "Cross-tabulation of 'goout' with 'high_use'"
cross_tab_goout
##    
##     FALSE TRUE
##   1    19    3
##   2    82   15
##   3    97   23
##   4    40   38
##   5    21   32
# Cross-tabulation of 'failures' with 'alc_use'
cross_tab_failures <- table(alc$failures, alc$high_use)
print("Cross-tabulation of 'failures' with 'high_use'")
## [1] "Cross-tabulation of 'failures' with 'high_use'"
cross_tab_failures
##    
##     FALSE TRUE
##   0   238   87
##   1    12   12
##   2     8    9
##   3     1    3
# Cross-tabulation of 'absences' with 'alc_use'
cross_tab_absences <- table(alc$absences, alc$high_use)
print("Cross-tabulation of 'absences' with 'high_use'")
## [1] "Cross-tabulation of 'absences' with 'high_use'"
cross_tab_absences
##     
##      FALSE TRUE
##   0     50   13
##   1     37   13
##   2     40   16
##   3     31    7
##   4     24   11
##   5     16    6
##   6     16    5
##   7      9    3
##   8     14    6
##   9      5    6
##   10     5    2
##   11     2    3
##   12     3    4
##   13     1    1
##   14     1    6
##   16     0    1
##   17     0    1
##   18     1    1
##   19     0    1
##   20     2    0
##   21     1    1
##   26     0    1
##   27     0    1
##   29     0    1
##   44     0    1
##   45     1    0
# Cross-tabulation of 'G3' with 'alc_use'
cross_tab_G3 <- table(alc$G3, alc$high_use)
print("Cross-tabulation of 'G3' with 'high_use'")
## [1] "Cross-tabulation of 'G3' with 'high_use'"
cross_tab_G3
##     
##      FALSE TRUE
##   0      1    1
##   2      0    1
##   3      1    0
##   4      5    1
##   5      3    2
##   6     15    6
##   7      3    2
##   8     18    8
##   9      3    4
##   10    38   25
##   11    16    9
##   12    51   27
##   13    18    7
##   14    31    7
##   15    15    1
##   16    29    8
##   17     5    2
##   18     7    0
#bar plots 
# Bar plot for 'goout' and its relationship with 'alc_use'
ggplot(data=alc, aes(x = goout, fill = high_use)) +
  geom_bar() +  
  labs(x = "Go Out with Friends", y = "Count", fill = "Alcohol Use") 

# Bar plot for 'failures' and its relationship with 'alc_use'
ggplot(alc, aes(x = failures, fill = high_use)) +
  geom_bar() +
  labs(x = "Number of Failures", y = "Count", fill = "Alcohol Use") 

#for the last two plots, I've tryed diferent way of ploting the bars

# Bar plot for 'absences' and its relationship with 'alc_use'
ggplot(alc, aes(x = absences, fill = high_use)) +
  geom_bar(position = "dodge", stat = "count") +
  labs(x = "Number of Absences", y = "Count", fill = "Alcohol Use") 

# Bar plot for 'G3' and its relationship with 'alc_use'
ggplot(alc, aes(x = G3, fill = high_use)) +
  geom_bar(position = "dodge", stat = "count") +
  labs(x = "Final Grade (G3)", y = "Count", fill = "Alcohol Use")

#boxplots

g_failures <- ggplot(alc, aes(x = high_use, y = failures, col = sex))
g_failures + geom_boxplot() + ylab("failures")

g_g3 <- ggplot(alc, aes(x = high_use, y = G3, col = sex))
g_g3 + geom_boxplot() + ylab("grade")

g_absences <- ggplot(alc, aes(x = high_use, y = absences, col = sex))
g_absences + geom_boxplot() + ylab("absences")

g_goout <- ggplot(alc, aes(x = high_use, y = goout, col = sex))
g_goout + geom_boxplot() + ylab("going_out")

I’ve used logistic regression to explore the relationship between my variables and the high/low alcohol consumption variable as the target variable.

For the variable ‘failures’, for each one-unit increase in the number of failures, the response of the high/low alcohol consumption (target variable) increases by 0.52078. The p-value (0.028128 ) is less than 0.05, indicating that the coefficient for failures is statistically significant.

For absences, for each additional absence, the log-odds of the response variable increases by 0.07373. The p-value is also less than 0.05 (0.000958), indicating that the coefficient for absences is statistically significant.

For the final grade (G3), for each one-unit increase in the G3 variable, the response of the target variable decreases by 0.01612. Here, the p-value (0.699253) is greater than 0.05, indicating that the coefficient for G3 is not statistically significant. Therefore, the variable G3 is not contributing significantly to the model.

For the variable of going out with friends, for each one-unit increase in the goout variable, the response of the target variable increases by 0.69791.The z-value (5.874) corresponds to a very small p-value (4.25e-09), which is highly statistically significant.

The second part of the code fits a logistic regression model, extracts the coefficients, computes the odds ratios, calculates confidence intervals, and then prints out the odds ratios with their confidence intervals. The confint() function is used to obtain the confidence intervals for the coefficients, and the exp() function is applied to exponent the values. The cbind() function is used to combine the odds ratios and confidence intervals into a single matrix for printing.

For failures, for each unity increased in the number of failures, the odds of the response variable increases by a factor of 1.6833396. (In summa, an increase in the number of failures is associated with an increase in the odds of high alcohol use).The 95% confidence interval for the odds ratio is (1.062542815 - 2.7040414).

For absences, for each unit increased in the number of absences, the odds of the response variable increases by a factor of 1.0765213 (in summa, an increase in the number of absences is associated with a slight increase in the odds of high alcohol useHere). The 95% confidence interval for the odds ratio is 1.031833703 - 1.1275904. For the final grades variable the odds of the target variable is 0.9840135x the odds of the reference level. The 95% confidence interval is 0.906930755 - 1.0685440. For the variable of going out, the odds of the target variable is 2.0095429 x the odds of the reference level. The 95% confidence interval is 1.600845280 - 2.5531997.

Regarding the Intercept, the odds of the baseline level of the target variable is 0.0326675 x the odds of the reference level when all predictors are zero. This indicates a lower likelihood of high alcohol use for the reference level. The 95% confidence interval for the odds ratio is 0.007953009 - 0.1220042.

# Explore the dataset
str(alc)
## 'data.frame':    370 obs. of  35 variables:
##  $ school    : chr  "GP" "GP" "GP" "GP" ...
##  $ sex       : chr  "F" "F" "F" "F" ...
##  $ age       : int  18 17 15 15 16 16 16 17 15 15 ...
##  $ address   : chr  "U" "U" "U" "U" ...
##  $ famsize   : chr  "GT3" "GT3" "LE3" "GT3" ...
##  $ Pstatus   : chr  "A" "T" "T" "T" ...
##  $ Medu      : int  4 1 1 4 3 4 2 4 3 3 ...
##  $ Fedu      : int  4 1 1 2 3 3 2 4 2 4 ...
##  $ Mjob      : chr  "at_home" "at_home" "at_home" "health" ...
##  $ Fjob      : chr  "teacher" "other" "other" "services" ...
##  $ reason    : chr  "course" "course" "other" "home" ...
##  $ guardian  : chr  "mother" "father" "mother" "mother" ...
##  $ traveltime: int  2 1 1 1 1 1 1 2 1 1 ...
##  $ studytime : int  2 2 2 3 2 2 2 2 2 2 ...
##  $ schoolsup : chr  "yes" "no" "yes" "no" ...
##  $ famsup    : chr  "no" "yes" "no" "yes" ...
##  $ activities: chr  "no" "no" "no" "yes" ...
##  $ nursery   : chr  "yes" "no" "yes" "yes" ...
##  $ higher    : chr  "yes" "yes" "yes" "yes" ...
##  $ internet  : chr  "no" "yes" "yes" "yes" ...
##  $ romantic  : chr  "no" "no" "no" "yes" ...
##  $ famrel    : int  4 5 4 3 4 5 4 4 4 5 ...
##  $ freetime  : int  3 3 3 2 3 4 4 1 2 5 ...
##  $ goout     : int  4 3 2 2 2 2 4 4 2 1 ...
##  $ Dalc      : int  1 1 2 1 1 1 1 1 1 1 ...
##  $ Walc      : int  1 1 3 1 2 2 1 1 1 1 ...
##  $ health    : int  3 3 3 5 5 5 3 1 1 5 ...
##  $ failures  : int  0 0 2 0 0 0 0 0 0 0 ...
##  $ paid      : chr  "no" "no" "yes" "yes" ...
##  $ absences  : int  5 3 8 1 2 8 0 4 0 0 ...
##  $ G1        : int  2 7 10 14 8 14 12 8 16 13 ...
##  $ G2        : int  8 8 10 14 12 14 12 9 17 14 ...
##  $ G3        : int  8 8 11 14 12 14 12 10 18 14 ...
##  $ alc_use   : num  1 1 2.5 1 1.5 1.5 1 1 1 1 ...
##  $ high_use  : logi  FALSE FALSE TRUE FALSE FALSE FALSE ...
# Fit the logistic regression model
model <- glm(high_use ~ failures + absences + G3 + goout, 
             data = alc, 
             family = "binomial")
summary(model)
## 
## Call:
## glm(formula = high_use ~ failures + absences + G3 + goout, family = "binomial", 
##     data = alc)
## 
## Coefficients:
##             Estimate Std. Error z value Pr(>|z|)    
## (Intercept) -3.42137    0.69467  -4.925 8.43e-07 ***
## failures     0.52078    0.23720   2.195 0.028128 *  
## absences     0.07373    0.02233   3.303 0.000958 ***
## G3          -0.01612    0.04171  -0.386 0.699253    
## goout        0.69791    0.11881   5.874 4.25e-09 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## (Dispersion parameter for binomial family taken to be 1)
## 
##     Null deviance: 452.04  on 369  degrees of freedom
## Residual deviance: 383.91  on 365  degrees of freedom
## AIC: 393.91
## 
## Number of Fisher Scoring iterations: 4
coef(model)
## (Intercept)    failures    absences          G3       goout 
## -3.42137467  0.52077970  0.07373480 -0.01611567  0.69790726
# compute odds ratios (OR)
OR <- coef(model) %>% exp

# compute confidence intervals (CI)
CI <- confint(model) %>% exp
## Waiting for profiling to be done...
# print out the odds ratios with their confidence intervals
result <- cbind(OR, CI)
print(result)
##                    OR       2.5 %    97.5 %
## (Intercept) 0.0326675 0.007953009 0.1220042
## failures    1.6833396 1.062542815 2.7040414
## absences    1.0765213 1.031833703 1.1275904
## G3          0.9840135 0.906930755 1.0685440
## goout       2.0095429 1.600845280 2.5531997
#model2 <- glm(high_use ~ failures + absences + G3 + goout -1,  data = alc,  family = "binomial")
# summary(model2)
#coef(model2)

The next part, provides insights into the performance of your logistic regression model, including accuracy, proportions, and inaccuracy metrics.

The scatter plot visually represents the predicted probabilities against the actual values. The points are colored based on whether the prediction is above (TRUE) or below (FALSE) the threshold of 0.5.

I computed a 2x2 confusion matrix to compare actual values and predictions. The confusion matrix provides counts of true positives, true negatives, false positives, and false negatives. For example, there are 252 true negatives (actual: FALSE, predicted: FALSE) and 33 true positives (actual: TRUE, predicted: TRUE).

By adding Margins to the proportional confision Matrix, gives us the row and column sums, providing an overview of the distribution of actual and predicted values. It shows that 70% of the observations are predicted as FALSE, and 30% are predicted as TRUE.

I’ve defined a loss function to compute the average number of wrong predictions. This indicates that if we predict all observations as 0 (using a threshold of 0.5), the average inaccuracy is 30%. also, if we predict all observations as 1 (using a threshold of 0.5), the average inaccuracy is 0.7 or 70%.

I’ve adjusted the code to change the prob argument by giving it the prediction probabilities in alc, which give us the indication that the average inaccuracy is 22.97% using the actual predicted probabilities from the model.

m <- glm(high_use ~ sex + failures + absences, data = alc, family = "binomial")
alc <- mutate(alc, probability = predict(m, type = "response"))
alc <- mutate(alc, prediction = probability > 0.5)

# initialize a plot of 'high_use' versus 'probability' in 'alc'
g <- ggplot(alc, aes(x = probability, y = high_use))



# add the aesthetic element col = prediction and draw the plot again
g + geom_point(aes(col = prediction))

# tabulate the target variable versus the predictions
table(high_use = alc$high_use, prediction = alc$prediction)
##         prediction
## high_use FALSE TRUE
##    FALSE   252    7
##    TRUE     78   33
# adjust the code: use %>% to apply the prop.table() function on the output of table()
table(high_use = alc$high_use, prediction = alc$prediction) %>% prop.table()
##         prediction
## high_use      FALSE       TRUE
##    FALSE 0.68108108 0.01891892
##    TRUE  0.21081081 0.08918919
# adjust the code: use %>% to apply the addmargins() function on the output of prop.table()
table(high_use = alc$high_use, prediction = alc$prediction) %>% prop.table() %>% addmargins()
##         prediction
## high_use      FALSE       TRUE        Sum
##    FALSE 0.68108108 0.01891892 0.70000000
##    TRUE  0.21081081 0.08918919 0.30000000
##    Sum   0.89189189 0.10810811 1.00000000
#innacuracy 
# define a loss function (mean prediction error)
loss_func <- function(class, prob) {
  n_wrong <- abs(class - prob) > 0.5
  mean(n_wrong)
}


# call loss_func to compute the average number of wrong predictions in the (training) data
loss_func(class = alc$high_use, prob = 0)
## [1] 0.3
# adjust the code: change the prob argument in the loss function to prob = 1
loss_func(class = alc$high_use, prob = 1)
## [1] 0.7
# adjust the code again: change the prob argument by giving it the prediction probabilities in alc (the column probability)
loss_func(class = alc$high_use, prob = alc$probability)
## [1] 0.2297297

Bonus and Super Bonus

Cross-validation provides a more robust estimate of a model’s performance by assessing its generalization to unseen data. The inaccuracy values give you an idea of how well the model is performing in terms of prediction errors. Lower inaccuracy values are generally desirable. The results [0.2297297] means the average inaccuracy (wrong predictions) on the training data is approximately 22.97%. So, on average, the model incorrectly predicts the outcome for about 22.97% of the observations when each observation is left out once during the cross-validation process.

In 10-fold cross-validation, the average inaccuracy on the testing data is approximately 24.05%. This means that, on average, the model incorrectly predicts the outcome for about 24.05% of the observations when the dataset is split into 10 folds, and the model is trained and tested on different subsets in each iteration.

The results suggest that the model’s performance is relatively consistent between leave-one-out cross-validation and 10-fold cross-validation. Both estimates of average inaccuracy are in a similar range, indicating that the model is making incorrect predictions for around 22-24% of the observations.

library(boot)
## Warning: package 'boot' was built under R version 4.3.2
m <- glm(high_use ~ sex + failures + absences, data = alc, family = "binomial")
alc <- mutate(alc, probability = predict(m, type = "response"))
alc <- mutate(alc, prediction = probability > 0.5)

loss_func <- function(class, prob) {
  n_wrong <- abs(class - prob) > 0.5
  mean(n_wrong)
}


loss_func(class = alc$high_use, prob = alc$probability)
## [1] 0.2297297
# K-fold cross-validation
cv <- cv.glm(data = alc, cost = loss_func, glmfit = m, K = nrow(alc))

# average number of wrong predictions in the cross validation
cv$delta[1]
## [1] 0.2405405
# 10-fold cross validation. 
cv_10fold <- cv.glm(data = alc, cost = loss_func, glmfit = m, K = 10)
cv_10fold$delta[1]
## [1] 0.2432432
# Fit the initial logistic regression model with a high number of predictors
full_model <- glm(high_use ~ ., 
                  data = alc, 
                  family = "binomial")
## Warning: glm.fit: algorithm did not converge
## Warning: glm.fit: fitted probabilities numerically 0 or 1 occurred
# Extract the initial set of predictors
initial_predictors <- c("failures", "absences", "G3", "goout", "sex")

# Create a list to store the results
cv_results_list <- list()

# Perform cross-validation with different sets of predictors
for (i in length(initial_predictors):1) {
  current_predictors <- initial_predictors[1:i]
  
  # Fit the logistic regression model with the current set of predictors
  current_model <- glm(high_use ~ ., 
                        data = select(alc, all_of(c(current_predictors, "high_use"))), 
                        family = "binomial")
  
  # Perform K-fold cross-validation
  cv_results <- cv.glm(data = select(alc, all_of(c(current_predictors, "high_use"))), 
                       cost = loss_func, glmfit = current_model, K = nrow(alc))
  
  # Store the results
  cv_results_list[[i]] <- cv_results
}

# Extract training and testing errors
training_errors <- sapply(cv_results_list, function(x) x$delta[1])
testing_errors <- sapply(cv_results_list, function(x) x$delta[2])

# Create a data frame for plotting
results_df <- data.frame(
  Predictors = length(initial_predictors):1,
  Training_Error = training_errors,
  Testing_Error = testing_errors
)

# Plot the trends of both training and testing errors by the number of predictors
ggplot(results_df, aes(x = Predictors)) +
  geom_line(aes(y = Training_Error, color = "Training Error")) +
  geom_line(aes(y = Testing_Error, color = "Testing Error")) +
  scale_color_manual(values = c("Training Error" = "blue", "Testing Error" = "red")) +
  labs(x = "Number of Predictors", y = "Error Rate", title = "Cross-Validation Error by Number of Predictors") +
  theme_minimal()


Chapter 4 - Assignment 4

Analysis Exercise

I’ve loaded the data from the MASS package. In the exercise it didn’t mention the data name, so I assumed we will be using the Boston dataset from MASS, as we did in the Exercise 4.

I have explored the structure and dimension of the data. I can observe that this dataset has 506 observations and 14 variables. I used the corrplot to visualize the correlation between variables of the Boston dataset. The 14 variables are described as followed:

library(tidyr)
library(readr)
library(dplyr)
library(ggplot2)
library(GGally)
library(MASS)
## Warning: package 'MASS' was built under R version 4.3.2
library(corrplot)
## Warning: package 'corrplot' was built under R version 4.3.2
# load the data from the MASS package
data("Boston")

# explore the dataset
str(Boston)
## 'data.frame':    506 obs. of  14 variables:
##  $ crim   : num  0.00632 0.02731 0.02729 0.03237 0.06905 ...
##  $ zn     : num  18 0 0 0 0 0 12.5 12.5 12.5 12.5 ...
##  $ indus  : num  2.31 7.07 7.07 2.18 2.18 2.18 7.87 7.87 7.87 7.87 ...
##  $ chas   : int  0 0 0 0 0 0 0 0 0 0 ...
##  $ nox    : num  0.538 0.469 0.469 0.458 0.458 0.458 0.524 0.524 0.524 0.524 ...
##  $ rm     : num  6.58 6.42 7.18 7 7.15 ...
##  $ age    : num  65.2 78.9 61.1 45.8 54.2 58.7 66.6 96.1 100 85.9 ...
##  $ dis    : num  4.09 4.97 4.97 6.06 6.06 ...
##  $ rad    : int  1 2 2 3 3 3 5 5 5 5 ...
##  $ tax    : num  296 242 242 222 222 222 311 311 311 311 ...
##  $ ptratio: num  15.3 17.8 17.8 18.7 18.7 18.7 15.2 15.2 15.2 15.2 ...
##  $ black  : num  397 397 393 395 397 ...
##  $ lstat  : num  4.98 9.14 4.03 2.94 5.33 ...
##  $ medv   : num  24 21.6 34.7 33.4 36.2 28.7 22.9 27.1 16.5 18.9 ...
summary(Boston)
##       crim                zn             indus            chas        
##  Min.   : 0.00632   Min.   :  0.00   Min.   : 0.46   Min.   :0.00000  
##  1st Qu.: 0.08205   1st Qu.:  0.00   1st Qu.: 5.19   1st Qu.:0.00000  
##  Median : 0.25651   Median :  0.00   Median : 9.69   Median :0.00000  
##  Mean   : 3.61352   Mean   : 11.36   Mean   :11.14   Mean   :0.06917  
##  3rd Qu.: 3.67708   3rd Qu.: 12.50   3rd Qu.:18.10   3rd Qu.:0.00000  
##  Max.   :88.97620   Max.   :100.00   Max.   :27.74   Max.   :1.00000  
##       nox               rm             age              dis        
##  Min.   :0.3850   Min.   :3.561   Min.   :  2.90   Min.   : 1.130  
##  1st Qu.:0.4490   1st Qu.:5.886   1st Qu.: 45.02   1st Qu.: 2.100  
##  Median :0.5380   Median :6.208   Median : 77.50   Median : 3.207  
##  Mean   :0.5547   Mean   :6.285   Mean   : 68.57   Mean   : 3.795  
##  3rd Qu.:0.6240   3rd Qu.:6.623   3rd Qu.: 94.08   3rd Qu.: 5.188  
##  Max.   :0.8710   Max.   :8.780   Max.   :100.00   Max.   :12.127  
##       rad              tax           ptratio          black       
##  Min.   : 1.000   Min.   :187.0   Min.   :12.60   Min.   :  0.32  
##  1st Qu.: 4.000   1st Qu.:279.0   1st Qu.:17.40   1st Qu.:375.38  
##  Median : 5.000   Median :330.0   Median :19.05   Median :391.44  
##  Mean   : 9.549   Mean   :408.2   Mean   :18.46   Mean   :356.67  
##  3rd Qu.:24.000   3rd Qu.:666.0   3rd Qu.:20.20   3rd Qu.:396.23  
##  Max.   :24.000   Max.   :711.0   Max.   :22.00   Max.   :396.90  
##      lstat            medv      
##  Min.   : 1.73   Min.   : 5.00  
##  1st Qu.: 6.95   1st Qu.:17.02  
##  Median :11.36   Median :21.20  
##  Mean   :12.65   Mean   :22.53  
##  3rd Qu.:16.95   3rd Qu.:25.00  
##  Max.   :37.97   Max.   :50.00
#checking if I can extract some valuable info from pairs and ggplots
pairs (Boston)

p_boston <- ggpairs(Boston, mapping = aes(alpha=0.3),
             lower = list(combo = wrap("facethist", bins = 20)))

p_boston

# calculate the correlation matrix and round it
cor_matrix <- cor(Boston) 


# visualize the correlation matrix
corrplot(cor_matrix, method="circle")

print(cor_matrix)
##                crim          zn       indus         chas         nox
## crim     1.00000000 -0.20046922  0.40658341 -0.055891582  0.42097171
## zn      -0.20046922  1.00000000 -0.53382819 -0.042696719 -0.51660371
## indus    0.40658341 -0.53382819  1.00000000  0.062938027  0.76365145
## chas    -0.05589158 -0.04269672  0.06293803  1.000000000  0.09120281
## nox      0.42097171 -0.51660371  0.76365145  0.091202807  1.00000000
## rm      -0.21924670  0.31199059 -0.39167585  0.091251225 -0.30218819
## age      0.35273425 -0.56953734  0.64477851  0.086517774  0.73147010
## dis     -0.37967009  0.66440822 -0.70802699 -0.099175780 -0.76923011
## rad      0.62550515 -0.31194783  0.59512927 -0.007368241  0.61144056
## tax      0.58276431 -0.31456332  0.72076018 -0.035586518  0.66802320
## ptratio  0.28994558 -0.39167855  0.38324756 -0.121515174  0.18893268
## black   -0.38506394  0.17552032 -0.35697654  0.048788485 -0.38005064
## lstat    0.45562148 -0.41299457  0.60379972 -0.053929298  0.59087892
## medv    -0.38830461  0.36044534 -0.48372516  0.175260177 -0.42732077
##                  rm         age         dis          rad         tax    ptratio
## crim    -0.21924670  0.35273425 -0.37967009  0.625505145  0.58276431  0.2899456
## zn       0.31199059 -0.56953734  0.66440822 -0.311947826 -0.31456332 -0.3916785
## indus   -0.39167585  0.64477851 -0.70802699  0.595129275  0.72076018  0.3832476
## chas     0.09125123  0.08651777 -0.09917578 -0.007368241 -0.03558652 -0.1215152
## nox     -0.30218819  0.73147010 -0.76923011  0.611440563  0.66802320  0.1889327
## rm       1.00000000 -0.24026493  0.20524621 -0.209846668 -0.29204783 -0.3555015
## age     -0.24026493  1.00000000 -0.74788054  0.456022452  0.50645559  0.2615150
## dis      0.20524621 -0.74788054  1.00000000 -0.494587930 -0.53443158 -0.2324705
## rad     -0.20984667  0.45602245 -0.49458793  1.000000000  0.91022819  0.4647412
## tax     -0.29204783  0.50645559 -0.53443158  0.910228189  1.00000000  0.4608530
## ptratio -0.35550149  0.26151501 -0.23247054  0.464741179  0.46085304  1.0000000
## black    0.12806864 -0.27353398  0.29151167 -0.444412816 -0.44180801 -0.1773833
## lstat   -0.61380827  0.60233853 -0.49699583  0.488676335  0.54399341  0.3740443
## medv     0.69535995 -0.37695457  0.24992873 -0.381626231 -0.46853593 -0.5077867
##               black      lstat       medv
## crim    -0.38506394  0.4556215 -0.3883046
## zn       0.17552032 -0.4129946  0.3604453
## indus   -0.35697654  0.6037997 -0.4837252
## chas     0.04878848 -0.0539293  0.1752602
## nox     -0.38005064  0.5908789 -0.4273208
## rm       0.12806864 -0.6138083  0.6953599
## age     -0.27353398  0.6023385 -0.3769546
## dis      0.29151167 -0.4969958  0.2499287
## rad     -0.44441282  0.4886763 -0.3816262
## tax     -0.44180801  0.5439934 -0.4685359
## ptratio -0.17738330  0.3740443 -0.5077867
## black    1.00000000 -0.3660869  0.3334608
## lstat   -0.36608690  1.0000000 -0.7376627
## medv     0.33346082 -0.7376627  1.0000000

By examining the correlation matrix and the corrplot, you can gain insights into the relationships between our different variables. Positive correlations (values closer to 1) indicate a positive relationship, negative correlations (values closer to -1) indicate a negative relationship, and values close to 0 suggest weak or no linear relationship.

# calculate the correlation matrix and round it using the pipe (%>%)
cor_matrix <- cor(Boston) %>% round(2)

# print the rounded correlation matrix
print(cor_matrix)
##          crim    zn indus  chas   nox    rm   age   dis   rad   tax ptratio
## crim     1.00 -0.20  0.41 -0.06  0.42 -0.22  0.35 -0.38  0.63  0.58    0.29
## zn      -0.20  1.00 -0.53 -0.04 -0.52  0.31 -0.57  0.66 -0.31 -0.31   -0.39
## indus    0.41 -0.53  1.00  0.06  0.76 -0.39  0.64 -0.71  0.60  0.72    0.38
## chas    -0.06 -0.04  0.06  1.00  0.09  0.09  0.09 -0.10 -0.01 -0.04   -0.12
## nox      0.42 -0.52  0.76  0.09  1.00 -0.30  0.73 -0.77  0.61  0.67    0.19
## rm      -0.22  0.31 -0.39  0.09 -0.30  1.00 -0.24  0.21 -0.21 -0.29   -0.36
## age      0.35 -0.57  0.64  0.09  0.73 -0.24  1.00 -0.75  0.46  0.51    0.26
## dis     -0.38  0.66 -0.71 -0.10 -0.77  0.21 -0.75  1.00 -0.49 -0.53   -0.23
## rad      0.63 -0.31  0.60 -0.01  0.61 -0.21  0.46 -0.49  1.00  0.91    0.46
## tax      0.58 -0.31  0.72 -0.04  0.67 -0.29  0.51 -0.53  0.91  1.00    0.46
## ptratio  0.29 -0.39  0.38 -0.12  0.19 -0.36  0.26 -0.23  0.46  0.46    1.00
## black   -0.39  0.18 -0.36  0.05 -0.38  0.13 -0.27  0.29 -0.44 -0.44   -0.18
## lstat    0.46 -0.41  0.60 -0.05  0.59 -0.61  0.60 -0.50  0.49  0.54    0.37
## medv    -0.39  0.36 -0.48  0.18 -0.43  0.70 -0.38  0.25 -0.38 -0.47   -0.51
##         black lstat  medv
## crim    -0.39  0.46 -0.39
## zn       0.18 -0.41  0.36
## indus   -0.36  0.60 -0.48
## chas     0.05 -0.05  0.18
## nox     -0.38  0.59 -0.43
## rm       0.13 -0.61  0.70
## age     -0.27  0.60 -0.38
## dis      0.29 -0.50  0.25
## rad     -0.44  0.49 -0.38
## tax     -0.44  0.54 -0.47
## ptratio -0.18  0.37 -0.51
## black    1.00 -0.37  0.33
## lstat   -0.37  1.00 -0.74
## medv     0.33 -0.74  1.00
# visualize the rounded correlation matrix with corrplot
library(corrplot)
corrplot(cor_matrix, method="circle", type = "upper", cl.pos = "b", tl.pos = "d", tl.cex = 0.6)

The Boston data contains only numerical values, so we can use the function ‘scale()’ to standardize the whole dataset. I created a categorical variable of the crime rate in the Boston dataset from a continuous one. The variables have been successfully standardized, a categorical variable ‘crime’ has been created, and the dataset is split into training and testing set. Each variable has been standardized, with a mean of 0 and a standard deviation of 1. The minimum, maximum, and quartile values for each variable are now on a standardized scale. The crime rate has been categorized into four groups based. There are 127 observations in the category [-0.419, -0.411], 126 in (-0.411, -0.39], 126 in (-0.39, 0.00739], and 127 in (0.00739, 9.92].

# center and standardize variables
boston_scaled <-  scale(Boston)

# summaries of the scaled variables
summary(boston_scaled)
##       crim                 zn               indus              chas        
##  Min.   :-0.419367   Min.   :-0.48724   Min.   :-1.5563   Min.   :-0.2723  
##  1st Qu.:-0.410563   1st Qu.:-0.48724   1st Qu.:-0.8668   1st Qu.:-0.2723  
##  Median :-0.390280   Median :-0.48724   Median :-0.2109   Median :-0.2723  
##  Mean   : 0.000000   Mean   : 0.00000   Mean   : 0.0000   Mean   : 0.0000  
##  3rd Qu.: 0.007389   3rd Qu.: 0.04872   3rd Qu.: 1.0150   3rd Qu.:-0.2723  
##  Max.   : 9.924110   Max.   : 3.80047   Max.   : 2.4202   Max.   : 3.6648  
##       nox                rm               age               dis         
##  Min.   :-1.4644   Min.   :-3.8764   Min.   :-2.3331   Min.   :-1.2658  
##  1st Qu.:-0.9121   1st Qu.:-0.5681   1st Qu.:-0.8366   1st Qu.:-0.8049  
##  Median :-0.1441   Median :-0.1084   Median : 0.3171   Median :-0.2790  
##  Mean   : 0.0000   Mean   : 0.0000   Mean   : 0.0000   Mean   : 0.0000  
##  3rd Qu.: 0.5981   3rd Qu.: 0.4823   3rd Qu.: 0.9059   3rd Qu.: 0.6617  
##  Max.   : 2.7296   Max.   : 3.5515   Max.   : 1.1164   Max.   : 3.9566  
##       rad               tax             ptratio            black        
##  Min.   :-0.9819   Min.   :-1.3127   Min.   :-2.7047   Min.   :-3.9033  
##  1st Qu.:-0.6373   1st Qu.:-0.7668   1st Qu.:-0.4876   1st Qu.: 0.2049  
##  Median :-0.5225   Median :-0.4642   Median : 0.2746   Median : 0.3808  
##  Mean   : 0.0000   Mean   : 0.0000   Mean   : 0.0000   Mean   : 0.0000  
##  3rd Qu.: 1.6596   3rd Qu.: 1.5294   3rd Qu.: 0.8058   3rd Qu.: 0.4332  
##  Max.   : 1.6596   Max.   : 1.7964   Max.   : 1.6372   Max.   : 0.4406  
##      lstat              medv        
##  Min.   :-1.5296   Min.   :-1.9063  
##  1st Qu.:-0.7986   1st Qu.:-0.5989  
##  Median :-0.1811   Median :-0.1449  
##  Mean   : 0.0000   Mean   : 0.0000  
##  3rd Qu.: 0.6024   3rd Qu.: 0.2683  
##  Max.   : 3.5453   Max.   : 2.9865
# class of the boston_scaled object
class(boston_scaled)
## [1] "matrix" "array"
# change the object to data frame
boston_scaled <- as.data.frame(boston_scaled)

# summary of the scaled crime rate
summary(boston_scaled$crim)
##      Min.   1st Qu.    Median      Mean   3rd Qu.      Max. 
## -0.419367 -0.410563 -0.390280  0.000000  0.007389  9.924110
# create a quantile vector of crim and print it
bins <- quantile(boston_scaled$crim)
bins
##           0%          25%          50%          75%         100% 
## -0.419366929 -0.410563278 -0.390280295  0.007389247  9.924109610
# create a categorical variable 'crime'
crime <- cut(boston_scaled$crim, breaks = bins, include.lowest = TRUE)

# look at the table of the new factor crime
table(crime)
## crime
## [-0.419,-0.411]  (-0.411,-0.39] (-0.39,0.00739]  (0.00739,9.92] 
##             127             126             126             127
# adjust the cut() function with labels
labels <- c("low", "med_low", "med_high", "high")
crime <- cut(boston_scaled$crim, breaks = bins, labels = labels, include.lowest = TRUE)

# look at the table of the new factor crime
table(crime)
## crime
##      low  med_low med_high     high 
##      127      126      126      127
# remove original crim from the dataset
boston_scaled <- dplyr::select(boston_scaled, -crim)

# add the new categorical value to scaled data
boston_scaled <- data.frame(boston_scaled, crime)

# number of rows in the Boston dataset 
n <- nrow(boston_scaled)

# choose randomly 80% of the rows
ind <- sample(n,  size = n * 0.8)

# create train set
train <- boston_scaled[ind,]

# create test set 
test <- boston_scaled[-ind,]

# save the correct classes from test data
correct_classes <- test$crime

# remove the crime variable from test data
test <- dplyr::select(test, -crime)
# number of rows in the Boston dataset 
n <- nrow(boston_scaled)

# choose randomly 80% of the rows
ind <- sample(n,  size = n * 0.8)

# create train set
train <- boston_scaled[ind,]

# create test set 
test <- boston_scaled[-ind,]

# save the correct classes from test data
correct_classes <- test$crime

# remove the crime variable from test data
test <- dplyr::select(test, -crime)

The LDA model has been trained on the training data and provides information on how predictor variables contribute to the classification of crime groups. LD1 explains the majority of the variance. The confusion matrix indicates how well the model performs on the test dataset, with a count of correct and incorrect predictions for each crime group.

The probabilities of each group occurring in the training data before considering any predictor, can be extracted from the Prior Probabilities of Groups. We have 25.74% of probability of the crime goup beeing ‘low’, 27.23% for med_low, 23.51% for ‘med_high’ and 23.51% for high.

For the group means, the values represent the mean standardized values for each predictor variable within each crime group. For example, in the “low” crime group, the average value of zn is approximately 0.97. For the coefficients of linear discriminants, the coefficients represent the loadings of each predictor variable on the linear discriminant functions (LD1, LD2, LD3). Positive coefficients means a positive association with the corresponding discriminant function.

The propotion of trace is showing the proportion of the total variance explained by each discriminant function. In our cae, LD1 explains approximately 95.22% of the variance, LD2 explains 3.55% and LD3 explains 1.23%.

The confusion matrix compares the predicted classes (low, med_low, med_high, high) with the correct classes in the test dataset. Our diagonal represents correct predictions, and off-diagonal elements represent misclassifications.

# linear discriminant analysis
lda.fit <- lda(crime ~ ., data = train)

# print the lda.fit object
lda.fit
## Call:
## lda(crime ~ ., data = train)
## 
## Prior probabilities of groups:
##       low   med_low  med_high      high 
## 0.2549505 0.2376238 0.2425743 0.2648515 
## 
## Group means:
##                   zn      indus        chas        nox          rm        age
## low       0.91001329 -0.9045760 -0.11943197 -0.8912890  0.42864231 -0.8794488
## med_low  -0.05534236 -0.4014473  0.05576262 -0.5822165 -0.11382890 -0.2947464
## med_high -0.38048487  0.1498044  0.24993933  0.3310919  0.09471629  0.4039239
## high     -0.48724019  1.0170108 -0.05155709  1.0700640 -0.40505275  0.8230892
##                 dis        rad        tax     ptratio       black        lstat
## low       0.8565111 -0.6908519 -0.7299494 -0.49921640  0.37510322 -0.752921042
## med_low   0.3723848 -0.5428218 -0.4838676 -0.07761562  0.34723829 -0.137211264
## med_high -0.3421943 -0.4064658 -0.3187235 -0.28252722  0.05280657  0.009716446
## high     -0.8587964  1.6392096  1.5148289  0.78203563 -0.72812966  0.869150388
##                 medv
## low       0.52720367
## med_low   0.00322859
## med_high  0.15974969
## high     -0.67952137
## 
## Coefficients of linear discriminants:
##                  LD1         LD2         LD3
## zn       0.127652475  0.61330576 -0.93313312
## indus    0.003009037 -0.32700725 -0.08118864
## chas    -0.072328157 -0.10289707  0.14673923
## nox      0.438322288 -0.71558302 -1.30384993
## rm      -0.090592242 -0.07057900 -0.16019911
## age      0.252793251 -0.36813052  0.01755247
## dis     -0.058715607 -0.22460981  0.09165474
## rad      3.116547103  0.95157307 -0.19786872
## tax      0.027949022  0.10588212  0.81643568
## ptratio  0.138673309 -0.03234884 -0.20249578
## black   -0.106727098  0.02453966  0.19470297
## lstat    0.247987482 -0.18164251  0.30659811
## medv     0.208889605 -0.28087811 -0.26752357
## 
## Proportion of trace:
##    LD1    LD2    LD3 
## 0.9552 0.0352 0.0097
# the function for lda biplot arrows
lda.arrows <- function(x, myscale = 1, arrow_heads = 0.1, color = "red", tex = 0.75, choices = c(1,2)){
  heads <- coef(x)
  graphics::arrows(x0 = 0, y0 = 0, 
         x1 = myscale * heads[,choices[1]], 
         y1 = myscale * heads[,choices[2]], col=color, length = arrow_heads)
  text(myscale * heads[,choices], labels = row.names(heads), 
       cex = tex, col=color, pos=3)
}

# target classes as numeric
classes <- as.numeric(train$crime)

# plot the lda results 
plot(lda.fit, dimen = 2, col = classes, pch = classes)
lda.arrows(lda.fit, myscale = 1)

# predict classes with test data
lda.pred <- predict(lda.fit, newdata = test)

# cross tabulate the results
table(correct = correct_classes, predicted = lda.pred$class)
##           predicted
## correct    low med_low med_high high
##   low       15       7        2    0
##   med_low    5      18        7    0
##   med_high   0       6       21    1
##   high       0       0        0   20

I calculated the total within-cluster sum of squares (twcss) for different numbers of clusters and plot the results to find the elbow point. The for visualizing the Clusters I’ve run the k-means algorithm again with the optimal number of clusters and visualize the clusters using the pairs() function. The pairs() function helped me to visualize the clusters in a scatterplot matrix. The color of points represents the cluster membership. Each point in the scatterplot matrix corresponds to an observation, and the grouping is based on the k-means clustering. The clustering helps us identify groups of observations that are similar in terms of standardized variables We can observe how observations within each cluster are similar to each other, and there is a clear separation between clusters.

data("Boston")

# Standardize the dataset
boston_scaled <- scale(Boston)

# Calculate distances between observations
distances <- dist(boston_scaled)

k_means_result <- kmeans(boston_scaled, centers = 3)

k_max <- 10
twcss <- sapply(1:k_max, function(k) kmeans(boston_scaled, centers = k)$tot.withinss)

plot(1:k_max, twcss, type = 'b', main = 'Total WCSS vs. Number of Clusters', xlab = 'Number of Clusters', ylab = 'Total WCSS')

# Run k-means again with the optimal number of clusters
optimal_k <- 3
k_means_optimal <- kmeans(boston_scaled, centers = optimal_k)

# Visualize the clusters with pairs() function, using first 5 variables
pairs(boston_scaled[, 1:5], col = k_means_optimal$cluster)

Bonus

# k-means clustering
km <- kmeans(Boston, centers = 4)

# plot the Boston dataset with clusters
pairs(Boston, col = km$cluster)

set.seed(123)

# determine the number of clusters
k_max <- 4

# calculate the total within sum of squares
twcss <- sapply(1:k_max, function(k){kmeans(Boston, k)$tot.withinss})

# visualize the results
qplot(x = 1:k_max, y = twcss, geom = 'line')

# k-means clustering
km <- kmeans(Boston, centers = 2)

# plot the Boston dataset with clusters
pairs(Boston, col = km$cluster)

Super Bonus

Given code in the exercise:

library(plotly)
## Warning: package 'plotly' was built under R version 4.3.2
## 
## Attaching package: 'plotly'
## The following object is masked from 'package:MASS':
## 
##     select
## The following object is masked from 'package:ggplot2':
## 
##     last_plot
## The following object is masked from 'package:stats':
## 
##     filter
## The following object is masked from 'package:graphics':
## 
##     layout
model_predictors <- dplyr::select(train, -crime)
# check the dimensions
dim(model_predictors)
## [1] 404  13
dim(lda.fit$scaling)
## [1] 13  3
# matrix multiplication
matrix_product <- as.matrix(model_predictors) %*% lda.fit$scaling
matrix_product <- as.data.frame(matrix_product)

plot_ly(x = matrix_product$LD1, y = matrix_product$LD2, z = matrix_product$LD3, type= 'scatter3d', mode='markers')

Modified code:

The dimensions of your data might be causing the issue. The number of rows in your matrix_product is 404, but the color vector has a length of 506. We need to make sure that the vectors used for x, y, and z have the same length as the color vector.

library(plotly)

#set.seed(123)
#k_max <- 3  
#km <- kmeans(Boston, centers = k_max)
model_predictors <- dplyr::select(train, -crime)

# Matrix multiplication
matrix_product <- as.matrix(model_predictors) %*% lda.fit$scaling
matrix_product <- as.data.frame(matrix_product)

# Check the dimensions
dim(matrix_product)
## [1] 404   3
length(train$crime)
## [1] 404
# Plot 3D scatter plot with color based on crime classes
plot_ly(data = matrix_product, x = ~LD1, y = ~LD2, z = ~LD3, 
        type = 'scatter3d', mode = 'markers', color = as.factor(train$crime),
        marker = list(size = 5))  # Adjust marker size as needed

Chapter 5 - Assignment 5

Analysis Exercise

library(dplyr)
library(ggplot2)
library(GGally)
library(readr)
library(corrplot)
library(tibble)
library(tidyr)

#mine 
human <- read_csv("data/human2.csv", show_col_types = FALSE)
#professor
#human <- read_csv("https://raw.githubusercontent.com/KimmoVehkalahti/Helsinki-Open-Data-Science/master/datasets/human2.csv", show_col_types = FALSE)

Point 1

#%keep <- c("Country", "Edu2.FM", "Labo.FM", "Life.Exp", "Edu.Exp", "GNI", "Mat.Mor", "Ado.Birth", "Parli.F")
#human <- select(human, one_of(keep))
#human <- filter(human, complete.cases(human))
#last <- nrow(human) - 7
#human <- human[1:last, ]


#move country name to rownames
human_ <- column_to_rownames(human, "Country")


str(human_)
## 'data.frame':    155 obs. of  8 variables:
##  $ Edu2.FM  : num  1.007 0.997 0.983 0.989 0.969 ...
##  $ Labo.FM  : num  0.891 0.819 0.825 0.884 0.829 ...
##  $ Life.Exp : num  81.6 82.4 83 80.2 81.6 80.9 80.9 79.1 82 81.8 ...
##  $ Edu.Exp  : num  17.5 20.2 15.8 18.7 17.9 16.5 18.6 16.5 15.9 19.2 ...
##  $ GNI      : num  64992 42261 56431 44025 45435 ...
##  $ Mat.Mor  : num  4 6 6 5 6 7 9 28 11 8 ...
##  $ Ado.Birth: num  7.8 12.1 1.9 5.1 6.2 3.8 8.2 31 14.5 25.3 ...
##  $ Parli.F  : num  39.6 30.5 28.5 38 36.9 36.9 19.9 19.4 28.2 31.4 ...
summary(human_)
##     Edu2.FM          Labo.FM          Life.Exp        Edu.Exp     
##  Min.   :0.1717   Min.   :0.1857   Min.   :49.00   Min.   : 5.40  
##  1st Qu.:0.7264   1st Qu.:0.5984   1st Qu.:66.30   1st Qu.:11.25  
##  Median :0.9375   Median :0.7535   Median :74.20   Median :13.50  
##  Mean   :0.8529   Mean   :0.7074   Mean   :71.65   Mean   :13.18  
##  3rd Qu.:0.9968   3rd Qu.:0.8535   3rd Qu.:77.25   3rd Qu.:15.20  
##  Max.   :1.4967   Max.   :1.0380   Max.   :83.50   Max.   :20.20  
##       GNI            Mat.Mor         Ado.Birth         Parli.F     
##  Min.   :   581   Min.   :   1.0   Min.   :  0.60   Min.   : 0.00  
##  1st Qu.:  4198   1st Qu.:  11.5   1st Qu.: 12.65   1st Qu.:12.40  
##  Median : 12040   Median :  49.0   Median : 33.60   Median :19.30  
##  Mean   : 17628   Mean   : 149.1   Mean   : 47.16   Mean   :20.91  
##  3rd Qu.: 24512   3rd Qu.: 190.0   3rd Qu.: 71.95   3rd Qu.:27.95  
##  Max.   :123124   Max.   :1100.0   Max.   :204.80   Max.   :57.50
# visualize the 'human_' variables
ggpairs(human_, progress = FALSE)

# compute the correlation matrix and visualize it with corrplot
cor(human_) %>% corrplot()

#ggplot(human,aes(Life.Exp,GNI, label=Country)) + geom_point() + geom_text() + theme_bw()

After moving the country names to rownames, we were able to show a summarize of our data I used ggpairs and corplot to see the relationship between the variables. We can observe that the variables correlate between each other and helps us to see the central tendency, variability, and distribution of each variable in your dataset

The R console give us the range of values, the central tendency (mean, median), and the spread of the data for each variable. For example, for the variable “Edu2.FM.” the minimum value is 0.1717, and the maximum is 1.4967. The median is 0.9375, which indicates that the middle value of the data is close to this point.

From the plots we can conclude that our variable are very highly correlated.

The colors represented in our corplot represent the strength and direction of the correlation between pairs of variables, as we can see in the scale (around zero is little or no correlation). We can conclude that we have strong correlation for the circles with more intensity color and have bigger diemeter size. Our blue circles represent positive correlations. For example if the variable Edu.Exp increases then Life.Exp tends also to increase. For the other hand, the red circles means a negative correlation. For example, if Life.Exp increases then Mat.Mor tends to decrease.

Point 2 - PCA without standardizing

human <- column_to_rownames(human, "Country")
summary(human)
##     Edu2.FM          Labo.FM          Life.Exp        Edu.Exp     
##  Min.   :0.1717   Min.   :0.1857   Min.   :49.00   Min.   : 5.40  
##  1st Qu.:0.7264   1st Qu.:0.5984   1st Qu.:66.30   1st Qu.:11.25  
##  Median :0.9375   Median :0.7535   Median :74.20   Median :13.50  
##  Mean   :0.8529   Mean   :0.7074   Mean   :71.65   Mean   :13.18  
##  3rd Qu.:0.9968   3rd Qu.:0.8535   3rd Qu.:77.25   3rd Qu.:15.20  
##  Max.   :1.4967   Max.   :1.0380   Max.   :83.50   Max.   :20.20  
##       GNI            Mat.Mor         Ado.Birth         Parli.F     
##  Min.   :   581   Min.   :   1.0   Min.   :  0.60   Min.   : 0.00  
##  1st Qu.:  4198   1st Qu.:  11.5   1st Qu.: 12.65   1st Qu.:12.40  
##  Median : 12040   Median :  49.0   Median : 33.60   Median :19.30  
##  Mean   : 17628   Mean   : 149.1   Mean   : 47.16   Mean   :20.91  
##  3rd Qu.: 24512   3rd Qu.: 190.0   3rd Qu.: 71.95   3rd Qu.:27.95  
##  Max.   :123124   Max.   :1100.0   Max.   :204.80   Max.   :57.50
# perform principal component analysis 
pca_human <- prcomp(human_)
biplot(pca_human, choices = 1:2)
## Warning in arrows(0, 0, y[, 1L] * 0.8, y[, 2L] * 0.8, col = col[2L], length =
## arrow.len): zero-length arrow is of indeterminate angle and so skipped

## Warning in arrows(0, 0, y[, 1L] * 0.8, y[, 2L] * 0.8, col = col[2L], length =
## arrow.len): zero-length arrow is of indeterminate angle and so skipped

## Warning in arrows(0, 0, y[, 1L] * 0.8, y[, 2L] * 0.8, col = col[2L], length =
## arrow.len): zero-length arrow is of indeterminate angle and so skipped

## Warning in arrows(0, 0, y[, 1L] * 0.8, y[, 2L] * 0.8, col = col[2L], length =
## arrow.len): zero-length arrow is of indeterminate angle and so skipped

## Warning in arrows(0, 0, y[, 1L] * 0.8, y[, 2L] * 0.8, col = col[2L], length =
## arrow.len): zero-length arrow is of indeterminate angle and so skipped

summary(pca_human)
## Importance of components:
##                              PC1      PC2   PC3   PC4   PC5   PC6    PC7    PC8
## Standard deviation     1.854e+04 185.5219 25.19 11.45 3.766 1.566 0.1912 0.1591
## Proportion of Variance 9.999e-01   0.0001  0.00  0.00 0.000 0.000 0.0000 0.0000
## Cumulative Proportion  9.999e-01   1.0000  1.00  1.00 1.000 1.000 1.0000 1.0000
#s <- summary(pca_human)
#pca_pr <- round(1*s$importance[2, ], digits = 5)
#print(pca_pr)
#paste0(names(pca_pr), " (", pca_pr, "%)")
#pc_lab = paste0(names(pca_pr), " (", pca_pr, "%)")
#biplot(pca_human, cex = c(0.8, 1), col = c("grey40", "deeppink2"), xlab = pc_lab[1] , ylab = pc_lab[2])

Point 3 and 4 - PCA with standardizing and interpretation

#human <- column_to_rownames(human, "Country")
human_std <- scale(human)
pca_human <- prcomp(human_std)

# create and print out a summary of pca_human
s <- summary(pca_human)


# rounded percentanges of variance captured by each PC
pca_pr <- round(1*s$importance[2, ], digits = 5)

# print out the percentages of variance
print(pca_pr)
##     PC1     PC2     PC3     PC4     PC5     PC6     PC7     PC8 
## 0.53605 0.16237 0.09571 0.07583 0.05477 0.03595 0.02634 0.01298
# create object pc_lab to be used as axis labels
paste0(names(pca_pr), " (", pca_pr, "%)")
## [1] "PC1 (0.53605%)" "PC2 (0.16237%)" "PC3 (0.09571%)" "PC4 (0.07583%)"
## [5] "PC5 (0.05477%)" "PC6 (0.03595%)" "PC7 (0.02634%)" "PC8 (0.01298%)"
pc_lab = paste0(names(pca_pr), " (", pca_pr, "%)")

# draw a biplot
biplot(pca_human, cex = c(0.8, 1), col = c("grey40", "deeppink2"), xlab = pc_lab[1] , ylab = pc_lab[2])

We can see that without standardizing the variables, the PCS is not very useful. The variance of the principle components are maximized. We can see with the chart without standardize variables that the double axes, related to GNI values, has values very big and that the data is not comparable. The GNI has such big values that all the variables that have small values hey don’t have enough variance. This is why is so important to have the variables in the same space (standardization) to be able to compare them more equal in the same analysis, as we did with the function scale().

By doing the PCS we can see how important this principle components are and understand if our multiple variables are very highly correlated. Now we can see that scale above and right refer to the original values of the variables, so they are in the range to -10 to 10. Then we move to principle components and we maximize de variance of the horizontal axis (PC1) . We can conclude that Labo.F and Parli.F are related to the second principle component. they are very independent from the other variables that are in other PC axis.

By doing the PCA we can conclude that 53.605% of the dataset can be summarized in one variable, which is PC1 and that 16.237% of the dataset can be explained in another variable, which is PC2. Our variable Parli.F and Labo.F are highly correlated and can explained around 16% of our original dataset.

Point 5

library(FactoMineR)
## Warning: package 'FactoMineR' was built under R version 4.3.2
tea <- read.csv("https://raw.githubusercontent.com/KimmoVehkalahti/Helsinki-Open-Data-Science/master/datasets/tea.csv", stringsAsFactors = TRUE)

str(tea)
## 'data.frame':    300 obs. of  36 variables:
##  $ breakfast       : Factor w/ 2 levels "breakfast","Not.breakfast": 1 1 2 2 1 2 1 2 1 1 ...
##  $ tea.time        : Factor w/ 2 levels "Not.tea time",..: 1 1 2 1 1 1 2 2 2 1 ...
##  $ evening         : Factor w/ 2 levels "evening","Not.evening": 2 2 1 2 1 2 2 1 2 1 ...
##  $ lunch           : Factor w/ 2 levels "lunch","Not.lunch": 2 2 2 2 2 2 2 2 2 2 ...
##  $ dinner          : Factor w/ 2 levels "dinner","Not.dinner": 2 2 1 1 2 1 2 2 2 2 ...
##  $ always          : Factor w/ 2 levels "always","Not.always": 2 2 2 2 1 2 2 2 2 2 ...
##  $ home            : Factor w/ 2 levels "home","Not.home": 1 1 1 1 1 1 1 1 1 1 ...
##  $ work            : Factor w/ 2 levels "Not.work","work": 1 1 2 1 1 1 1 1 1 1 ...
##  $ tearoom         : Factor w/ 2 levels "Not.tearoom",..: 1 1 1 1 1 1 1 1 1 2 ...
##  $ friends         : Factor w/ 2 levels "friends","Not.friends": 2 2 1 2 2 2 1 2 2 2 ...
##  $ resto           : Factor w/ 2 levels "Not.resto","resto": 1 1 2 1 1 1 1 1 1 1 ...
##  $ pub             : Factor w/ 2 levels "Not.pub","pub": 1 1 1 1 1 1 1 1 1 1 ...
##  $ Tea             : Factor w/ 3 levels "black","Earl Grey",..: 1 1 2 2 2 2 2 1 2 1 ...
##  $ How             : Factor w/ 4 levels "alone","lemon",..: 1 3 1 1 1 1 1 3 3 1 ...
##  $ sugar           : Factor w/ 2 levels "No.sugar","sugar": 2 1 1 2 1 1 1 1 1 1 ...
##  $ how             : Factor w/ 3 levels "tea bag","tea bag+unpackaged",..: 1 1 1 1 1 1 1 1 2 2 ...
##  $ where           : Factor w/ 3 levels "chain store",..: 1 1 1 1 1 1 1 1 2 2 ...
##  $ price           : Factor w/ 6 levels "p_branded","p_cheap",..: 4 6 6 6 6 3 6 6 5 5 ...
##  $ age             : int  39 45 47 23 48 21 37 36 40 37 ...
##  $ sex             : Factor w/ 2 levels "F","M": 2 1 1 2 2 2 2 1 2 2 ...
##  $ SPC             : Factor w/ 7 levels "employee","middle",..: 2 2 4 6 1 6 5 2 5 5 ...
##  $ Sport           : Factor w/ 2 levels "Not.sportsman",..: 2 2 2 1 2 2 2 2 2 1 ...
##  $ age_Q           : Factor w/ 5 levels "+60","15-24",..: 4 5 5 2 5 2 4 4 4 4 ...
##  $ frequency       : Factor w/ 4 levels "+2/day","1 to 2/week",..: 3 3 1 3 1 3 4 2 1 1 ...
##  $ escape.exoticism: Factor w/ 2 levels "escape-exoticism",..: 2 1 2 1 1 2 2 2 2 2 ...
##  $ spirituality    : Factor w/ 2 levels "Not.spirituality",..: 1 1 1 2 2 1 1 1 1 1 ...
##  $ healthy         : Factor w/ 2 levels "healthy","Not.healthy": 1 1 1 1 2 1 1 1 2 1 ...
##  $ diuretic        : Factor w/ 2 levels "diuretic","Not.diuretic": 2 1 1 2 1 2 2 2 2 1 ...
##  $ friendliness    : Factor w/ 2 levels "friendliness",..: 2 2 1 2 1 2 2 1 2 1 ...
##  $ iron.absorption : Factor w/ 2 levels "iron absorption",..: 2 2 2 2 2 2 2 2 2 2 ...
##  $ feminine        : Factor w/ 2 levels "feminine","Not.feminine": 2 2 2 2 2 2 2 1 2 2 ...
##  $ sophisticated   : Factor w/ 2 levels "Not.sophisticated",..: 1 1 1 2 1 1 1 2 2 1 ...
##  $ slimming        : Factor w/ 2 levels "No.slimming",..: 1 1 1 1 1 1 1 1 1 1 ...
##  $ exciting        : Factor w/ 2 levels "exciting","No.exciting": 2 1 2 2 2 2 2 2 2 2 ...
##  $ relaxing        : Factor w/ 2 levels "No.relaxing",..: 1 1 2 2 2 2 2 2 2 2 ...
##  $ effect.on.health: Factor w/ 2 levels "effect on health",..: 2 2 2 2 2 2 2 2 2 2 ...
View(tea)

keep_columns <- c("Tea", "How", "how", "sugar", "where", "lunch")
tea_time <- select(tea, one_of(keep_columns))
summary(tea_time)
##         Tea         How                      how           sugar    
##  black    : 74   alone:195   tea bag           :170   No.sugar:155  
##  Earl Grey:193   lemon: 33   tea bag+unpackaged: 94   sugar   :145  
##  green    : 33   milk : 63   unpackaged        : 36                 
##                  other:  9                                          
##                   where           lunch    
##  chain store         :192   lunch    : 44  
##  chain store+tea shop: 78   Not.lunch:256  
##  tea shop            : 30                  
## 
str(tea_time)
## 'data.frame':    300 obs. of  6 variables:
##  $ Tea  : Factor w/ 3 levels "black","Earl Grey",..: 1 1 2 2 2 2 2 1 2 1 ...
##  $ How  : Factor w/ 4 levels "alone","lemon",..: 1 3 1 1 1 1 1 3 3 1 ...
##  $ how  : Factor w/ 3 levels "tea bag","tea bag+unpackaged",..: 1 1 1 1 1 1 1 1 2 2 ...
##  $ sugar: Factor w/ 2 levels "No.sugar","sugar": 2 1 1 2 1 1 1 1 1 1 ...
##  $ where: Factor w/ 3 levels "chain store",..: 1 1 1 1 1 1 1 1 2 2 ...
##  $ lunch: Factor w/ 2 levels "lunch","Not.lunch": 2 2 2 2 2 2 2 2 2 2 ...
# visualize the dataset
pivot_longer(tea_time, cols = everything()) %>% 
  ggplot(aes(value)) + 
  facet_wrap("name", scales = "free") +
  geom_bar() +
  theme(axis.text.x = element_text(angle = 45, hjust = 1, size = 8))

#MCA
mca <- MCA(tea_time, graph = FALSE)
summary(mca)
## 
## Call:
## MCA(X = tea_time, graph = FALSE) 
## 
## 
## Eigenvalues
##                        Dim.1   Dim.2   Dim.3   Dim.4   Dim.5   Dim.6   Dim.7
## Variance               0.279   0.261   0.219   0.189   0.177   0.156   0.144
## % of var.             15.238  14.232  11.964  10.333   9.667   8.519   7.841
## Cumulative % of var.  15.238  29.471  41.435  51.768  61.434  69.953  77.794
##                        Dim.8   Dim.9  Dim.10  Dim.11
## Variance               0.141   0.117   0.087   0.062
## % of var.              7.705   6.392   4.724   3.385
## Cumulative % of var.  85.500  91.891  96.615 100.000
## 
## Individuals (the 10 first)
##                       Dim.1    ctr   cos2    Dim.2    ctr   cos2    Dim.3
## 1                  | -0.298  0.106  0.086 | -0.328  0.137  0.105 | -0.327
## 2                  | -0.237  0.067  0.036 | -0.136  0.024  0.012 | -0.695
## 3                  | -0.369  0.162  0.231 | -0.300  0.115  0.153 | -0.202
## 4                  | -0.530  0.335  0.460 | -0.318  0.129  0.166 |  0.211
## 5                  | -0.369  0.162  0.231 | -0.300  0.115  0.153 | -0.202
## 6                  | -0.369  0.162  0.231 | -0.300  0.115  0.153 | -0.202
## 7                  | -0.369  0.162  0.231 | -0.300  0.115  0.153 | -0.202
## 8                  | -0.237  0.067  0.036 | -0.136  0.024  0.012 | -0.695
## 9                  |  0.143  0.024  0.012 |  0.871  0.969  0.435 | -0.067
## 10                 |  0.476  0.271  0.140 |  0.687  0.604  0.291 | -0.650
##                       ctr   cos2  
## 1                   0.163  0.104 |
## 2                   0.735  0.314 |
## 3                   0.062  0.069 |
## 4                   0.068  0.073 |
## 5                   0.062  0.069 |
## 6                   0.062  0.069 |
## 7                   0.062  0.069 |
## 8                   0.735  0.314 |
## 9                   0.007  0.003 |
## 10                  0.643  0.261 |
## 
## Categories (the 10 first)
##                        Dim.1     ctr    cos2  v.test     Dim.2     ctr    cos2
## black              |   0.473   3.288   0.073   4.677 |   0.094   0.139   0.003
## Earl Grey          |  -0.264   2.680   0.126  -6.137 |   0.123   0.626   0.027
## green              |   0.486   1.547   0.029   2.952 |  -0.933   6.111   0.107
## alone              |  -0.018   0.012   0.001  -0.418 |  -0.262   2.841   0.127
## lemon              |   0.669   2.938   0.055   4.068 |   0.531   1.979   0.035
## milk               |  -0.337   1.420   0.030  -3.002 |   0.272   0.990   0.020
## other              |   0.288   0.148   0.003   0.876 |   1.820   6.347   0.102
## tea bag            |  -0.608  12.499   0.483 -12.023 |  -0.351   4.459   0.161
## tea bag+unpackaged |   0.350   2.289   0.056   4.088 |   1.024  20.968   0.478
## unpackaged         |   1.958  27.432   0.523  12.499 |  -1.015   7.898   0.141
##                     v.test     Dim.3     ctr    cos2  v.test  
## black                0.929 |  -1.081  21.888   0.382 -10.692 |
## Earl Grey            2.867 |   0.433   9.160   0.338  10.053 |
## green               -5.669 |  -0.108   0.098   0.001  -0.659 |
## alone               -6.164 |  -0.113   0.627   0.024  -2.655 |
## lemon                3.226 |   1.329  14.771   0.218   8.081 |
## milk                 2.422 |   0.013   0.003   0.000   0.116 |
## other                5.534 |  -2.524  14.526   0.197  -7.676 |
## tea bag             -6.941 |  -0.065   0.183   0.006  -1.287 |
## tea bag+unpackaged  11.956 |   0.019   0.009   0.000   0.226 |
## unpackaged          -6.482 |   0.257   0.602   0.009   1.640 |
## 
## Categorical variables (eta2)
##                      Dim.1 Dim.2 Dim.3  
## Tea                | 0.126 0.108 0.410 |
## How                | 0.076 0.190 0.394 |
## how                | 0.708 0.522 0.010 |
## sugar              | 0.065 0.001 0.336 |
## where              | 0.702 0.681 0.055 |
## lunch              | 0.000 0.064 0.111 |
plot(mca, invisible=c("ind"), graph.type = "classic", habillage = "quali")

The contribution of the variable categories (in %) to the definition of the dimensions can be extracted in the R cosnole under the table which contains the three dimensions. The dimensions (Dim.1, Dim.2, Dim.3) capture different patterns or structures in the categorical data. These patterns represent relationships or associations between categories of the variables. For example, in Dim.1, “how” and “where” have relatively high contributions (0.708 and 0.702, respectively), suggesting that these categories are important in explaining the patterns represented by Dim.1.

The bar plot suggests that the biggest percentages of the individuals prefer to drink tea alone, outside lunch hours. The tea bag is the most used and Earl Grey tea is drinking type most used. Most of the people buys tea in the chain store. The difference between the individuals that drinks tee with and without sugar is low.

The plot.MCA() generates a variable biplot, which displays the relationships between the variables, helping us to identify variables that are the most correlated with each dimension.

The axes of the plot represent the dimensions (factors) extracted by the MCA. Dimension 1 is represented by the horizontal axis and Dimension 2 by the vertical axis.The percentage of variability explained by each dimension is given: 15.24% for dimension 1 and 14.23% for the dimension 2. The distance of a point from the origin reflects the strength of its association with the dimensions. Categories farther from the origin have a stronger influence on the variability in the data. Along Dimension 1, we see on the map that ‘tea shop’ and ‘unpackaged’ are furthest away from the origin and therefore have the most importance (are the most correlated with dimension 1).


Chapter 6 - Assignment 6

Analysis Exercise

Part 1, using RATS data

# Read RATS dataset
RATS <- read_csv("data/RATS_long.csv", show_col_type = FALSE)

RATS$ID <- factor(RATS$ID)
RATS$Group <- factor(RATS$Group)


# Plot the RATS data
ggplot(RATS, aes(x = Time, y = Weight, group = ID, linetype = Group)) +
  geom_line() +
  scale_x_continuous(name = "Time (days)", breaks = seq(0, 60, 10)) +
  scale_y_continuous(name = "Weight (grams)") +
  theme(legend.position = "top")

# Create a regression model RATS_reg
RATS_reg <- lm(Weight ~ Time + Group, data = RATS)

# Print out a summary of the model
summary(RATS_reg)
## 
## Call:
## lm(formula = Weight ~ Time + Group, data = RATS)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -60.643 -24.017   0.697  10.837 125.459 
## 
## Coefficients:
##             Estimate Std. Error t value Pr(>|t|)    
## (Intercept) 244.0689     5.7725  42.281  < 2e-16 ***
## Time          0.5857     0.1331   4.402 1.88e-05 ***
## Group2      220.9886     6.3402  34.855  < 2e-16 ***
## Group3      262.0795     6.3402  41.336  < 2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 34.34 on 172 degrees of freedom
## Multiple R-squared:  0.9283, Adjusted R-squared:  0.9271 
## F-statistic: 742.6 on 3 and 172 DF,  p-value: < 2.2e-16
# Access library lme4
library(lme4)
## Warning: package 'lme4' was built under R version 4.3.2
## Loading required package: Matrix
## Warning: package 'Matrix' was built under R version 4.3.2
## 
## Attaching package: 'Matrix'
## The following objects are masked from 'package:tidyr':
## 
##     expand, pack, unpack
# Create a random intercept model ref and ref1
RATS_ref <- lmer(Weight ~ Time + Group + (1 | ID), data = RATS, REML = FALSE)
RATS_ref1 <- lmer(Weight ~ Time + Group + (Time | ID), data = RATS, REML = FALSE)
## Warning in checkConv(attr(opt, "derivs"), opt$par, ctrl = control$checkConv, :
## Model failed to converge with max|grad| = 0.0028726 (tol = 0.002, component 1)
# Print the summary of the model
summary(RATS_ref)
## Linear mixed model fit by maximum likelihood  ['lmerMod']
## Formula: Weight ~ Time + Group + (1 | ID)
##    Data: RATS
## 
##      AIC      BIC   logLik deviance df.resid 
##   1333.2   1352.2   -660.6   1321.2      170 
## 
## Scaled residuals: 
##     Min      1Q  Median      3Q     Max 
## -3.5386 -0.5581 -0.0494  0.5693  3.0990 
## 
## Random effects:
##  Groups   Name        Variance Std.Dev.
##  ID       (Intercept) 1085.92  32.953  
##  Residual               66.44   8.151  
## Number of obs: 176, groups:  ID, 16
## 
## Fixed effects:
##              Estimate Std. Error t value
## (Intercept) 244.06890   11.73107   20.80
## Time          0.58568    0.03158   18.54
## Group2      220.98864   20.23577   10.92
## Group3      262.07955   20.23577   12.95
## 
## Correlation of Fixed Effects:
##        (Intr) Time   Group2
## Time   -0.090              
## Group2 -0.575  0.000       
## Group3 -0.575  0.000  0.333
# Print a summary of the model
summary(RATS_ref1)
## Linear mixed model fit by maximum likelihood  ['lmerMod']
## Formula: Weight ~ Time + Group + (Time | ID)
##    Data: RATS
## 
##      AIC      BIC   logLik deviance df.resid 
##   1194.2   1219.6   -589.1   1178.2      168 
## 
## Scaled residuals: 
##     Min      1Q  Median      3Q     Max 
## -3.2259 -0.4322  0.0555  0.5637  2.8825 
## 
## Random effects:
##  Groups   Name        Variance  Std.Dev. Corr 
##  ID       (Intercept) 1139.2004 33.7520       
##           Time           0.1122  0.3349  -0.22
##  Residual               19.7479  4.4439       
## Number of obs: 176, groups:  ID, 16
## 
## Fixed effects:
##              Estimate Std. Error t value
## (Intercept) 246.46821   11.80888  20.871
## Time          0.58568    0.08548   6.852
## Group2      214.55803   20.16986  10.638
## Group3      258.91288   20.16986  12.837
## 
## Correlation of Fixed Effects:
##        (Intr) Time   Group2
## Time   -0.166              
## Group2 -0.569  0.000       
## Group3 -0.569  0.000  0.333
## optimizer (nloptwrap) convergence code: 0 (OK)
## Model failed to converge with max|grad| = 0.0028726 (tol = 0.002, component 1)
# Perform an ANOVA test on the two models
anova(RATS_ref1, RATS_ref)
## Data: RATS
## Models:
## RATS_ref: Weight ~ Time + Group + (1 | ID)
## RATS_ref1: Weight ~ Time + Group + (Time | ID)
##           npar    AIC    BIC  logLik deviance  Chisq Df Pr(>Chisq)    
## RATS_ref     6 1333.2 1352.2 -660.58   1321.2                         
## RATS_ref1    8 1194.2 1219.6 -589.11   1178.2 142.94  2  < 2.2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
RATS_ref2 <- lmer(Weight ~ Time * Group + (Time | ID), data = RATS, REML = FALSE)
## Warning in checkConv(attr(opt, "derivs"), opt$par, ctrl = control$checkConv, :
## Model failed to converge with max|grad| = 0.00504145 (tol = 0.002, component 1)
# Print a summary of the model
summary(RATS_ref2)
## Linear mixed model fit by maximum likelihood  ['lmerMod']
## Formula: Weight ~ Time * Group + (Time | ID)
##    Data: RATS
## 
##      AIC      BIC   logLik deviance df.resid 
##   1185.9   1217.6   -582.9   1165.9      166 
## 
## Scaled residuals: 
##     Min      1Q  Median      3Q     Max 
## -3.2667 -0.4249  0.0726  0.6034  2.7511 
## 
## Random effects:
##  Groups   Name        Variance  Std.Dev. Corr 
##  ID       (Intercept) 1.106e+03 33.2534       
##           Time        4.925e-02  0.2219  -0.15
##  Residual             1.975e+01  4.4439       
## Number of obs: 176, groups:  ID, 16
## 
## Fixed effects:
##              Estimate Std. Error t value
## (Intercept) 251.65165   11.79473  21.336
## Time          0.35964    0.08215   4.378
## Group2      200.66549   20.42907   9.823
## Group3      252.07168   20.42907  12.339
## Time:Group2   0.60584    0.14229   4.258
## Time:Group3   0.29834    0.14229   2.097
## 
## Correlation of Fixed Effects:
##             (Intr) Time   Group2 Group3 Tm:Gr2
## Time        -0.160                            
## Group2      -0.577  0.092                     
## Group3      -0.577  0.092  0.333              
## Time:Group2  0.092 -0.577 -0.160 -0.053       
## Time:Group3  0.092 -0.577 -0.053 -0.160  0.333
## optimizer (nloptwrap) convergence code: 0 (OK)
## Model failed to converge with max|grad| = 0.00504145 (tol = 0.002, component 1)
# Perform an ANOVA test on the two models
anova(RATS_ref2, RATS_ref1)
## Data: RATS
## Models:
## RATS_ref1: Weight ~ Time + Group + (Time | ID)
## RATS_ref2: Weight ~ Time * Group + (Time | ID)
##           npar    AIC    BIC  logLik deviance  Chisq Df Pr(>Chisq)   
## RATS_ref1    8 1194.2 1219.6 -589.11   1178.2                        
## RATS_ref2   10 1185.9 1217.6 -582.93   1165.9 12.361  2    0.00207 **
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
# Draw the plot of RATS with the observed Weight values
ggplot(RATS, aes(x = Time, y = Weight, group = ID)) +
  geom_line(aes(linetype = Group)) +
  scale_x_continuous(name = "Time (days)", breaks = seq(0, 60, 20)) +
  scale_y_continuous(name = "Observed weight (grams)") +
  theme(legend.position = "top")

# Create a vector of the fitted values
Fitted <- fitted(RATS_ref2)

# Create a new column Fitted to RATS
RATS <- RATS %>%
  mutate(Fitted = Fitted)

# Draw the plot of RATS with the Fitted values of weight
ggplot(RATS, aes(x = Time, y = Fitted, group = ID)) +
  geom_line(aes(linetype = Group)) +
  scale_x_continuous(name = "Time (days)", breaks = seq(0, 60, 20)) +
  scale_y_continuous(name = "Fitted weight (grams)") +
  theme(legend.position = "top")

ANOTHER TRY

Part 2, using BPRS data

library(dplyr)
library(ggplot2)
library(readr)
library(tibble)
library(tidyr) 

# Load the BPRS dataset
BPRS <- read.table("https://raw.githubusercontent.com/KimmoVehkalahti/MABS/master/Examples/data/BPRS.txt", sep = " ", header = TRUE)
BPRS$treatment <- factor(BPRS$treatment)
BPRS$subject <- factor(BPRS$subject)

# Pivot the BPRS data
BPRSL <- pivot_longer(BPRS, cols = -c(treatment, subject), names_to = "weeks", values_to = "bprs") %>% arrange(weeks)
BPRSL$week <- as.integer(substr(BPRSL$weeks, 5, 5))

# Standardize the variable bprs
BPRSL <- BPRSL %>%
  group_by(week) %>%
  mutate(stdbprs = (bprs - mean(bprs)) / sd(bprs)) %>%
  ungroup()

# Draw the plot with observed bprs values
ggplot(BPRSL, aes(x = week, y = bprs, linetype = subject)) +
  geom_line() +
  scale_linetype_manual(values = rep(1:10, times=4)) +
  facet_grid(. ~ treatment, labeller = label_both) +
  theme(legend.position = "none") + 
  scale_y_continuous(limits = c(min(BPRSL$bprs), max(BPRSL$bprs)))

# Create a vector of the fitted values
BPRSL_ref2 <- lmer(bprs ~ week * treatment + (week | subject), data = BPRSL, REML = FALSE)
BPRSL$Fitted <- fitted(BPRSL_ref2)

# Draw the plot with Fitted values of bprs

ggplot(BPRSL, aes(x = week, y = Fitted, linetype = subject)) +
  geom_line() +
  scale_linetype_manual(values = rep(1:10, times=4)) +
  facet_grid(. ~ treatment, labeller = label_both) +
  theme(legend.position = "none") + 
  scale_y_continuous(limits = c(min(BPRSL$bprs), max(BPRSL$bprs)))

# Number of subjects per group
n <- 20

# Summary data with mean and standard error of bprs by treatment and week
BPRSS <- BPRSL %>%
  group_by(treatment, week) %>%
  summarise(mean = mean(bprs), se = sd(bprs) / sqrt(n)) %>%
  ungroup()
## `summarise()` has grouped output by 'treatment'. You can override using the
## `.groups` argument.
# Plot the mean profiles
ggplot(BPRSS, aes(x = week, y = mean, linetype = treatment, shape = treatment)) +
  geom_line(color = "black") +
  scale_linetype_manual(values = c(1, 2)) +
  geom_point(size = 3, color = "black") +
  scale_shape_manual(values = c(1, 2)) +
  theme(legend.position = c(0.8, 0.8)) +
  scale_y_continuous(name = "mean(bprs) +/- se(bprs)")

# Create a summary data by treatment and subject with mean as the summary variable (ignoring baseline week 0)
BPRSL8S <- BPRSL %>%
  filter(week > 0) %>%
  group_by(treatment, subject) %>%
  summarise(mean = mean(bprs)) %>%
  ungroup()
## `summarise()` has grouped output by 'treatment'. You can override using the
## `.groups` argument.
# Draw a boxplot of the mean versus treatment
ggplot(BPRSL8S, aes(x = treatment, y = mean)) +
  geom_boxplot() +
  stat_summary(fun = "mean", geom = "point", shape = 23, size = 4, fill = "white") +
  scale_y_continuous(name = "mean(bprs), weeks 1-8")

# Find a suitable threshold value and use filter() to exclude the outlier
threshold <- 60
BPRSL8S1 <- BPRSL8S %>%
  filter(mean <= threshold)

# Draw a boxplot of the new data
ggplot(BPRSL8S1, aes(x = treatment, y = mean)) +
  geom_boxplot() +
  stat_summary(fun = "mean", geom = "point", shape = 23, size = 4, fill = "white") +
  scale_y_continuous(name = "mean(bprs), weeks 1-8")

# Perform a two-sample t-test
t.test(mean ~ treatment, data = BPRSL8S1, var.equal = TRUE)
## 
##  Two Sample t-test
## 
## data:  mean by treatment
## t = 0.52095, df = 37, p-value = 0.6055
## alternative hypothesis: true difference in means between group 1 and group 2 is not equal to 0
## 95 percent confidence interval:
##  -4.232480  7.162085
## sample estimates:
## mean in group 1 mean in group 2 
##        36.16875        34.70395
# Add the baseline from the original data as a new variable to the summary data
BPRSL8S2 <- BPRSL8S %>%
  mutate(baseline = BPRS$week0)

# Fit the linear model with the mean as the response
fit <- lm(mean ~ baseline + treatment, data = BPRSL8S2)

# Compute the analysis of variance table for the fitted model with anova()
anova_result <- anova(fit)
print(anova_result)
## Analysis of Variance Table
## 
## Response: mean
##           Df  Sum Sq Mean Sq F value    Pr(>F)    
## baseline   1 1868.07 1868.07 30.1437 3.077e-06 ***
## treatment  1    3.45    3.45  0.0557    0.8148    
## Residuals 37 2292.97   61.97                      
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1